Canopy clustering algorithm について

Words near each other

・ Canopus Islands
・ Canopus Lake
・ Canopus Rocks
・ Canopus, Egypt
・ Canopus-class battleship
・ Canopus-class ship of the line
・ Canopy
・ Canopy (biology)
・ Canopy (building)
・ Canopy (film)
・ Canopy (grape)
・ Canopy (parachute)
・ Canopy bed
・ Canopy chameleon
・ Canopy Cliffs
・ Canopy clustering algorithm
・ Canopy conductance
・ Canopy express
・ Canopy Flyer
・ Canopy formation
・ Canopy Glow
・ Canopy goanna
・ Canopy Group
・ Canopy Growth Corporation
・ Canopy Housing
・ Canopy Innovations
・ Canopy interception
・ Canopy Labs
・ Canopy level of temperate rainforest
・ Canopy piloting

Dictionary Lists

mini英和辞書

翻訳と辞書　辞書検索 [ 開発暫定版 ]

スポンサードリンク

Canopy clustering algorithm ：ウィキペディア英語版

Canopy clustering algorithm
The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000.〔 It is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set.
The algorithm proceeds as follows, using two thresholds

T_1

(the loose distance) and

T_2

(the tight distance), where

T_1 > T_2

.〔McCallum, A.; Nigam, K.; and Ungar L.H. (2000) ("Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching" ), Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 〕〔http://courses.cs.washington.edu/courses/cse590q/04au/slides/DannyMcCallumKDD00.ppt Retrieved 2014-09-06.〕
# Begin with the set of data points to be clustered.
# Remove a point from the set, beginning a new 'canopy'.
# For each point left in the set, assign it to the new canopy if the distance less than the loose distance

T_1

.
# If the distance of the point is additionally less than the tight distance

T_2

, remove it from the original set.
# Repeat from step 2 until there are no more data points in the set to cluster.
# These relatively cheaply clustered canopies can be sub-clustered using a more expensive but accurate algorithm.
An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.
Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the curse of dimensionality. Only when a cheap and approximative – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means.
==Benefits==

* The number of instances of training data that must be compared at each step is reduced
* There is some evidence that the resulting clusters are improved〔(Mahout description of Canopy-Clustering )
Retrieved 2011-04-02.〕

抄文引用元・出典: フリー百科事典『ウィキペディア（Wikipedia）』
■ウィキペディアで「Canopy clustering algorithm」の詳細全文を読む

スポンサードリンク

翻訳と辞書 : 翻訳のためのインターネットリソース